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We present a novel nonparametric Bayesian approach based on 
Levy Adaptive Regression Kernels (LARK) to model spectral data 
arising from MALDI-TOF (Matrix Assisted Laser Desorption Ioniza- 
tion Time-of-Flight) mass spectrometry. This model-based approach 
Q^ , provides identification and quantification of proteins through model 

■^^ • parameters that are directly interpretable as the number of proteins, 

mass and abundance of proteins and peak resolution, while having the 
j^ ■ ability to adapt to unknown smoothness as in wavelet based meth- 

ods. Informative prior distributions on resolution are key to distin- 
guishing true peaks from background noise and resolving broad peaks 
into individual peaks for multiple protein species. Posterior distri- 
butions are obtained using a reversible jump Markov chain Monte 
Carlo algorithm and provide inference about the number of peaks 
(proteins), their masses and abundance. We show through simula- 
tion studies that the procedure has desirable true-positive and false- 
\J~>. . discovery rates. Finally, we illustrate the method on five example 

spectra: a blank spectrum, a spectrum with only the matrix of a low- 
molecular-weight substance used to embed target proteins, a spec- 
trum with known proteins, and a single spectrum and average of ten 
spectra from an individual lung cancer patient. 
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1. Introduction. Recent innovations in protein separation methods, ion- 
ization procedures and detection algorithms have led mass spectrometry 
(MS) to play a vital role in the explosive growth of proteomics [Dass (2001), 
page xxi]. Despite technological advances in data collection, it remains chal- 
lenging to extract biologically relevant information (such as biomarkers) 
from MS spectral data [Dass (2001), Chapters 3 and 5; Coombes et al. 
(2005a); Baggerly, Morris and Coombes (2004, 2006); Clyde, House and Wol- 
pert 2006; Morris et al. (2006)]. 

Identifying peak locations (which represent proteins) and quantifying pro- 
tein abundance in spectra is often preceded by a multi-stage analysis involv- 
ing calibration, normalization, baseline subtraction and filtering of noise 
[Morris et al. (2005); Tibshirani et al. (2004); Yasui et al. (2003)]. A prob- 
lem with such an approach is that each individual step may introduce errors, 
artifacts or biases that may interfere with later stages of the analyses such 
as classification of subjects or identification of biomarkers. Methods that 
model background, noise and features simultaneously may lead to improved 
classification or inferences [Coombes et al. (2005b)]. Nonparametric methods 
such as wavelet regression have proved successful in simultaneously model- 
ing background and denoising, allowing one to extract features or regions of 
spectra that differentiate groups [Yasui et al. (2003); Coombes et al. (2005b); 
Wang, Ray and Mallick (2007); Morris et al. (2008)]. While wavelets are well 
suited for modeling local features like spectral peaks, neither the scales and 
locations that index basis functions nor coefficients used in the wavelet rep- 
resentation of expected intensity have any inherent biological interpretation 
for the typical wavelets used in practice, such as Daubechies' "least asym- 
metric" la8 wavelet family. As an alternative to nonparametric regression 
for modeling intensities, Guindani et al. (2006) and Miiller et al. (2010) 
developed a Bayesian mixture model based on beta distributions to esti- 
mate a density function for time-of-flight. The parameters of this model are 
more interpretable than the wavelet regression methods, but the approach 
does not incorporate information about peak resolution, which we will show 
allows one to resolve broad peaks into multiple peaks. 

In this paper we propose a novel nonparametric method employing Levy 
Adaptive Kernel Regression (LARK) models [Wolpert, Clyde and Tu (2011); 
Clyde and Wolpert (2007)], which retains the adaptivity and flexibility that 
make wavelet and other nonparametric methods appealing but, in contrast 
to these other methods, uses model parameters with direct biological inter- 
pretations. This offers the opportunity to elicit meaningful expert opinion to 
guide the selection of prior distributions, and a posteriori to provide posterior 
distributions on model parameters that are meaningful to the expert. The 
model presented in this article is intended for use with a "single" spectrum, 
although an entire set of exchangeable spectra may be analyzed by applying 
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the method to the mean spectrum, similar to the mean-spectrum undeci- 
mated wavelet thresholding (MUDWT) method of Morris et al. (2005). 

The paper is arranged as follows. We begin in Section 2 with a brief 
overview of MALDI-TOF mass spectrometry. In Section 3 we develop a sta- 
tistical model for protein abundance as a function of time-of-flight based 
on the recently developed nonparametric LARK models. Prior distributions 
for the model parameters are developed in Section 4 from elicited expert 
knowledge about the MALDI-TOF procedure and from exploratory anal- 
ysis of MALDI-TOF data from related experiments. Inference about pa- 
rameters of clinical interest are obtained from posterior distributions and 
are described in Section 5. We then illustrate the methodology in Section 6 
using a series of experiments with real data. We use laboratory data from 
three experiments with known sources: blank spectra, which help to char- 
acterize noise in MALDI-TOF experiments; matrix spectra, which help to 
characterize background; and a known protein mixture, which illustrates the 
model's ability to resolve masses. The LARK model is also applied to single 
and mean spectra from a recent lung cancer study conducted at the Duke 
University Medical Center. In Section 7 we the compare the LARK model 
to the wavelet method of Morris et al. (2005) using mean spectra from the 
TOF simulator of Coombes et al. (2005b). We conclude with a discussion 
and suggestions for future work in Section 8. 

2. MALDI-TOF data. In Matrix Assisted Laser Desorption Time-of- 
Flight Mass Spectrometry, or MALDI-TOF MS, inference about the molec- 
ular composition of a sample is based on indirect measurement of molecular 
masses. Molecules are initially embedded in a matrix of low molecular weight 
substance, such as sinapinic acid, and placed on a metal plate. The molecules 
are then simultaneously dislodged (by vaporizing the substrate) and ionized 
(by removing one or more electrons from molecules) by a series of laser 
pulses. The now-charged molecules are accelerated by a strong electric field 
toward a detector. In the Applied Biosystems Voyager DE Biospectrome- 
try Workstation [Applied Biosystems (2001)], a linear detector measures ion 
abundance over time, then sends a signal at regular time intervals (clock 
ticks) to a digitizer for conversion to measured intensities; for the examples 
considered below, samples are taken at regular 4 ns or 16 ns intervals. The 
reported intensity in each time interval is typically the aggregate sum over 
several repeated laser "shots," leading to what we will refer to as a sin- 
gle spectrum, with response ion intensity at corresponding time-of-flights 
(TOFs). 

Figure 1 illustrates serum protein spectra from a single individual with 
lung cancer from a study conducted at the Duke Medical Center Radiology 
Department [Wang et al. (2003)]. Each serum sample was separated into 20 
fractions along a pH gradient prior to the MALDI-TOF analysis to reduce 
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Fig. 1. Single spectrum (a) and mean of ten spectra (b) from a lung cancer patient. 



saturation of the signal. Ten replicated spectra were obtained for each frac- 
tion, each with ten laser shots, using Voyager with a sinapinic acid matrix. 
For our analyses we randomly selected one fraction from one subject; Fig- 
ure 1 shows the single spectrum from the chosen subject-fraction (a), and the 
mean spectrum obtained by averaging the intensities of all ten replicates (b) 
from the same fraction. Morris et al. (2005) suggest using the mean spectrum 
as a way to reduce noise from various sources as discussed below. 

Distance traveled under constant acceleration is a quadratic function of 
time, leading to a simple but nonlinear relationship between TOF and the 
molecules' masses and ionic charge (the latter two enter only through their 
quotient, the mass to charge ratio m/z [Coombes et al. (2005a)]). Under 
ideal conditions the TOF spectrum would show a spike at the TOF corre- 
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sponding to each molecular species present. In actual MALDI-TOF spectra 
(Figure 1) we observe irregular peaks rather than one-dimensional spikes 
because molecules of equal size and charge do not all reach the detector at 
the same time. The most important of the many causes of TOF dispersion 
is variability in the amount of ionizing laser energy received by molecules of 
varying location within the matrix; those further from the matrix surface or 
from the center of the laser pulse may receive less kinetic energy and thus 
have lower initial velocities than similarly-sized molecules located closer to 
the center, delaying their arrival at the detector. Molecules may exchange 
energy in collisions, and may lose or gain mass through fragmentation and 
agglomeration, respectively. All these lead to TOF variation for each molec- 
ular species [Coombes et al. (2005a); Zhigilei and Garrison (1998); Franzen 
(1997)]. Furthermore, while the abundance of a protein in the sample ide- 
ally corresponds to the area under the TOF distribution curve, factors such 
as ion suppression, multiple charged ions, adducts (the addition of other 
molecules to the protein), protein-protein interactions and isotope distribu- 
tions may result in a protein being represented by more than one peak in 
the spectrum. 

The interpretation and analysis of MALDI-TOF data are complicated 
by several other sources of variation described by Morris et al. (2005) and 
Coombes et al. (2005a). In addition to measurement error which may mask 
or distort protein peaks, at least three other sources complicate the compar- 
ison or synthesis of multiple spectra: calibration (uncertainty in the conver- 
sion of TOF to m/z, including variable latency that affects time registration) ; 
background (a constant or even time- varying trend in the overall level); and 
scale (caused by many things including variability of laser intensity). One 
way to accommodate this is to construct models for peak identification and 
quantification that incorporate these recognized sources of variability, as in 
the wavelet approach of Coombes et al. (2005b), Morris et al. (2005) devel- 
oped for calibrated spectra. Our approach, based on kernel models, has the 
added advantage that model parameters have direct physical interpretations. 

3. A model for MALDI-TOF. To reduce the variability attributable to 
differing numbers of laser shots and differing baselines, we first standardize 
the spectrum and model 

, N Y, ob - min(Y ob ) 
(3.1) Yt = - j-± 

for a raw spectrum Y ob = {5^ ob } for Tq <t <T\, where To and T\ correspond 
to the range of TOFs of scientific interest and / is the number of laser shots 
summarized by Y ob . The initial molecular velocities are expected to be ap- 
proximately Gaussian in distribution [Dass (2001), page 75]. This and the 
dynamics of the MALDI-TOF process [Coombes et al. (2005a)] suggest that 
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TOFs for a single isotopic peak will also have symmetric bell-shaped distri- 
butions in the time domain, leading us [and others — see Morris et al. (2005); 
Malyarenko et al. (2005)] to prefer time of flight (TOF, in /is) rather than 
mass-to-charge ratios (m/z, in Da/e) for spectral modeling (although we 
follow convention in reporting and plotting results results in m/z). Because 
ions from the matrix may saturate the detector at initial TOFs [Malyarenko 
et al. (2005)] and masses less than 2 kDa were not of scientific interest to our 
collaborators, To will correspond to the TOF of a mass of 2 kDa throughout, 
unless otherwise noted. While the nonparametric model that we propose can 
accommodate an arbitrary lower bound (even To = 0), modeling these extra 
initial peaks will increase the running time of the algorithm, with little or no 
improvement in peak identification or model fit for the rest of the spectrum. 

3.1. Peak shape. The shape of a symmetric isotopic peak may be rep- 
resented by a probability density function for TOF t with parameters gov- 
erning the protein peak's location r and width uj. Examples include the 
Gaussian 

(3.2) k(t;r,u)= exp(-|t-r| 2 /2^ 2 ) 



and Cauchy (sometimes called Lorentzian in the MS literature) 

(3.3) k(t;r,u)= 2 1^, 

TT{LO Z + \t — Tf) 

as in Dass (2001), page 75, Kempka et al. (2004), and Applied Biosystems 
(2001), pages 6-30. 

A protein signature associated with J peaks may now be represented as 
a sum 

J 

(3.4) f{t) = ^2k(t;r j ,co j )r ]j , 

i=i 

where {tj}, {uij} and {r/j} represent the unknown location (TOF), peak 
width and abundance of the jth protein (or molecule), respectively. This is 
a special case of the Levy Adaptive Regression Kernel (LARK) models of 
Wolpert, Clyde and Tu (2011), Clyde and Wolpert (2007), which generalize 
classical kernel regression [Wand and Jones (1995)] by allowing the number 
of kernels and the "smoothing parameters" (uj) of the kernel k to adapt to 
the unknown degree of smoothness in the data, as in wavelet models [Morris 
et al. (2005)]. While both methods lead to excellent function reconstructions, 
the parameters in the kernels (r,- , ujj ) and kernel coefficients rjj of the LARK 
model have direct biological interpretations which aide in prior specification 
(detailed below) and posterior interpretation. 
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3.2. Peak width and resolution. Protein peaks tend to be broader for 
late-arriving molecules than for earlier ones, with width nearly proportional 
to arrival time [Siuzdak (2003), page 44]; for this reason it is conventional 
in mass spectrometry to quantify the precision (narrowness) of a kernel 
k(-;r, oo) not by the scale uj, but by the resolution 



(3.5) 



p = t/At, 



where Ar, the so-called full width at half mass or FWHM, is the width of the 
kernel k(-;r,uj) at half its height [Dass (2001), page 120]. For a symmetric 
kernel, Ar is the solution of the equation 

k(r± iAr;r,w) = ^k(T;T,u). 



For the Gaussian and Cauchy kernels we have Ar = 2u; v / Iog4 and Ar = 2qj, 
respectively, leading to ui = oj{r,p) with 



(3.6) 



u(r,p) 



2pVIog4 



and 



2p 



for the Gaussian and Cauchy kernels, respectively. Prior knowledge about 
resolution can be used to resolve the ambiguity illustrated in Figure 2, where 
the observed spectrum may arise from either a single wide peak or a pair 
of nearby narrower peaks. As depicted later in Figure 6(d)-(f), we illustrate 
how the model is able to "deconvolve" a wide peak into several individual 
protein peaks in real data. 

3.3. Background noise sources. Even in the absence of any protein mole- 
cules [i.e., with f(t) = 0], the MALDI-TOF spectrum does not vanish. Fig- 
ure 4(a) shows the nearly-constant level of thermal noise from a run with 
an empty plate, while Figure 4(b) illustrates the rapidly-decreasing signal 
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Fig. 2. The (nearly indistinguishable) lines represent simulated protein signals from 
a sample with either one wide peak (solid) centered at 120 fis or a mixture (dashed) of two 
narrow peaks centered at bold ticks. 
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with only the sinapinic acid matrix, showing the early arrival at the detector 
of ionized matrix molecules (far lighter than typical proteins under study). 
Since the signal from matrix ions dominates the thermal noise or detec- 
tor "ringing," together these sources contribute a background that falls off 
nearly exponentially to a nonzero asymptote. 

Exploratory analysis suggests that the matrix molecular signal fio(t) can 
be modeled adequately as a constant (see below) plus an exponential func- 
tion, 

(3.7) /3 (t) = k (t;u )7]o = — exp{-t/w }l{t>o}> 
with a characteristic decay time of ujq > and intensity rjo > 0. 

3.4. Expected spectral intensity. To reflect all of these features, we model 
the expected spectral intensity as 

(3.8) tit) = <iO--S) + S\f(t) + Po(t)]} 

for an overall scale £, a dimensionless signal-to-background ratio S € [0, 1], 
the protein signal f(t) from equation (3.4), and the matrix molecular sig- 
nature /3o(t) from equation (3.7). The term S represents the proportion of 
observed intensity produced by molecular signal (both matrix and protein), 
rather than by thermal noise. 

3.5. Likelihood. Both gamma and log-normal distributions are commonly 
used to model positive data like the standardized responses Yj. The variance 
is proportional to the mean for gamma distributions and to the square of 
the mean for log-normals. Exploratory data analysis (from both a Box-Cox 
approach and a robust regression illustrated in Figure 3) suggests that the 
conditional variance of standardized MS data Yt, given the mean, is nearly 
proportional to the first power of the mean, supporting the gamma model 

(3.9) YM-),^ d G a (^(t),^), 

with mean fj,(t) and relative precision parameter ip (similar relationships hold 
for the other data sets, although the slopes vary). This leads to a measure- 
ment-error model with likelihood function 

n 

(3.10) C(9;Y) = l[Ga(Y t .;w(t i ),<p) 

1=1 

for the parameter vector 9 comprising the conditional mean function /x(-) 
[or, equivalently from equation (3.8), all of £, J, {tj,u)j,t)j}i<j<j, S, ljo, 
and r/o] and (p. Here Y = {Y(ti)}i<i<n represents the vector of standardized 
intensities from (3.1), and Ga(y;a,/3) = r7^y2/ Q_1 e _ ^ / l{ 2/ >o} is the probabil- 
ity density function at y G R for the gamma Ga(a,/3) distribution. 
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Fig. 3. Linear (solid) and quadratic (dashed) fits of variance versus mean intensity of 
intensity for 200 [is blocks of observations from a single spectrum from a lung cancer 
patient. 



Typically the likelihood function of equation (3.10) has many modes be- 
cause it is difficult to distinguish wide peaks from clusters of narrow ones, or 
small peaks from noise, from the data alone. Estimating 6 (and in particu- 
lar J, the number of protein peaks) by direct maximization of the likelihood 
leads to over-fitting the data and to over-estimating J. This can be overcome 
by regularization or by a Bayesian approach, in which prior distributions ef- 
fectively penalize overly complex models. 



4. Prior distributions for MALDI-TOF. We now address the problem of 
constructing a joint prior distribution for all the unknown parameters of the 
model of Section 3, 



(4.1) 



n(t) = ({(l-S) + S[f(t)+f3 (t)]}, 
J 

Po(t) = — exp{-(i-T )/u;o}l{t>T }- 

We discuss the approach we used to construct the distributions, employing 
public knowledge where possible or default procedures otherwise. Parameter 
values used for the different data sets are summarized in Table 1. 
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Table 1 
Prior hyperparameters for each of the real and simulated data sets 



Prior 






Date 


i set 






hyper- 








Single 


Mean 


Simulation 


parameters 


Blank 


Matrix 


Known 


lung 


lung 


study 


/'■/ 


20 


20 


20 


100 


100 


150 


A 


0.03 


0.03 


0.03 


0.11 


0.11 


0.65 


e 


0.79 


0.79 


0.79 


0.21 


0.21 


0.03 


T 


5.58 


5.58 


5.58 


0.03 


0.03 


13.47 


Tx 


217.14 


217.14 


217.14 


278.04 


278.04 


82.78 


°l 


0.1225 


0.1225 


0.1225 


0.1225 


0.1225 


0.1225 


/'p 


200 


200 


200 


200 


200 


300 


-I 


0.49 


0.49 


0.49 


0.49 


0.49 


0.49 


"v 


0.25 


0.25 


0.25 


0.25 


0.25 


- 


b v 


0.02 


0.11 


25.91 


1.14 


0.17 


- 


as 


1.36 


8.33 


8.71 


9.46 


7.29 


6.33 


b s 


1 


1 


1 


1 


1 


1 


A„ 


0.0009 


0.0005 


0.0012 


0.0006 


0.0004 


0.0124 


U>0 


171.68 


290.14 


127.30 


210.96 


268.52 


11.22 


cr Uo 


0.25 


0.25 


0.25 


0.25 


0.25 


0.25 



4.1. Measurement error ip and overall level £• The exploratory data 
analysis of experimental spectra (see Section 3.5) suggests that sample vari- 
ances of {Yt} are nearly proportional to the mean. We chose a gamma prior 
distribution ip ~ Ga(a ¥ ,,fe ¥ ,) for the mean-to-variance ratio (p, centered at 
a data-based value but supporting a wide range of prior uncertainty. We 
binned the observations {Yt} of each data set in 50 //s-wide blocks and cal- 
culated their block-specific means and variances. The prior mean a^/b^ was 
set to the slope of a regression of the block means on block variances, with 
a v = 0.25 to attain a coefficient of variation of 2. 

The parameter £ may be interpreted as the mean level or scale for Yt , since 
E[/(£)] ~ 1 (see Section 4.2). Since experimental levels depend on a wide 
range of exogenous variables and vary widely among trials, it is difficult to 
elicit a subjective prior distribution for this quantity. We instead employed 
a data-dependent rescaling and set £ = Y, the overall mean intensity. This 
is comparable to rescaling the raw data by the average or total intensity. 
Sensitivity analysis showed that this gave results very similar to those under 
a tight data-dependent prior distribution. 

4.2. Prior distribution for protein signature /(•). We specify a prior dis- 
tribution for the protein signature 



f(t) = ^2k(t;T j ,u> j )n j 
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by first specifying a distribution for J and then, conditional on J, tak- 
ing {tj,ujj,t]j} to be i.i.d. from a specified joint distribution [as in the 
infinitely divisible construction of the LARK models of Wolpert, Clyde 
and Tu (2011)]. To reflect uncertainty in the possible number of peaks, 
Wolpert, Clyde and Tu used a negative binomial distribution NB(aj,/ij) 
for J with mean and shape parameters p,j and aj. To simplify elicitation 
while providing robust inference over a range of spectra, we set aj = 1 
throughout, leading to a geometric distribution: 

P[J = 3\ M ]=(-L-)(-^-)\ j €{0,1,...} 

with mean \ij representing the expected number of peaks for a given exper- 
iment. Campa et al. (2003) found approximately fifty proteins for fraction- 
ated samples similar to the single and mean spectra described in Section 6.4, 
leading to perhaps seventy or so peaks due to adducts, multiply charged 
ions, etc. On this basis we chose \xj = 100 with a median of J ~ 70 peaks 
with symmetric 50%, 90% and 99% ranges of approximately 30 < J < 140, 
5 < J < 300, and 0.50 < J < 532.5, respectively. For the blank, matrix and 
known protein spectra, we set pj = 20. 

There is little reason to give higher prior probability to one range of TOFs 
than another without prior knowledge of the collection of proteins present 

in the samples. Thus, we take {tj}i<j<j ~' Un(To,Ti) (independently of 
J and {Aj}i<j<j), for some interval of length T = T\ — To, large enough 
to include the TOF for all molecules of interest. To eliminate saturation by 
matrix molecules at the low end, and to include as wide as possible a range of 
the biologically relevant molecules, we chose a TOF interval corresponding 
to the range [2 kDa/e < m/z < 75 kDa/e] for all experimental data. Differing 
sampling rates and calibration levels for different experiments lead the TOF 
ranges [To,Ti] to vary across experiments (see Table 1). 

We use expert opinion to construct an informed prior distribution on the 
resolutions {pj}i<j<j (see Section 3.2), which induces a distribution on the 
peak widths {^j}i<j<j- It has been suggested [Siuzdak (2003), page 44] that 
individual peak resolutions pj should be nearly constant across the entire 
TOF range, but in practice they are observed to vary [see pages 6-32 of 
Applied Biosystems (2001)]. To reflect this variation, we construct a hierar- 
chical prior distribution for the resolution parameters {pj}i<j<j as follows. 
Independently of the number of peaks, TOFs and abundance, we take the 
resolutions to have log-normal distributions, 

^og(Q)\fi g ,a 2 e ~ N(log(p s ),a 2 g ), 
log(pj)\Q,a 21 ~N(log(g),a 2 ), 
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centered at an overall "experiment" resolution g (which may be machine- 
or condition-specific). We use the manufacturer's reported resolution ranges 
[Applied Biosystems (2001), Table 6-2 and Table H-6] for the Voyager work- 
station and set \x Q = 200 and set a 2 Q = 0.49, so a priori the distribution covers 
the range 50-800 with 95% probability, and 32-928 with 99% probability. 
The standard deviation for the individual resolutions was set to a p = 0.35, 
seventy percent of the population standard deviation for resolution. For 
a population resolution of p e = 50, this leads to a prior 99% interval for indi- 
vidual resolutions of (20, 120), while at the upper extreme with a population 
resolution of p, Q = 800, the prior 99% interval covers the range (325, 1,710). 
Finally, the relationship between width, TOF and resolution given by equa- 
tion (3.6) induces a log-normal prior distribution on the width parameters, 

l0g(Uj)\Tj,Pj ~ fi(log(Tj/cpj), of), j = l,..., J, 



with c = 2 for the Cauchy kernel and c = 2-^/log4 for the Gaussian. 

For protein abundances {r]j} we use the left-truncated gamma distribution 
Ga(0, A,e) with parameters a = and A,e (chosen below), whose density 
function is given in general by 

(4.2) Ga(7/;a,A,e) = fT^^" 16 "^ 1 !^}' 

where T(a, x) — f°° z a ~ 1 e~ z dz denotes the complimentary incomplete gam- 
ma function [Abramowitz and Stegun (1964), Section 6.5.3]. For a, A > this 
is the conditional density for a gamma-distributed Ga(ct, A) random variable, 
given that it exceeds e > 0; for strictly positive e > 0, the distribution is well- 
defined for all a S ffi including the limiting case a = [Wolpert, Clyde and 
Tu (2011)], which we adopt. The mean is 

E[v] = a^eT(a7) ' 

where Ei(z) denotes the exponential integral function [Abramowitz and Ste- 
gun (1964), page 228]. The parameter e may be interpreted as the minimum 
detectable abundance. In the limit e — > 0, this distribution permits an in- 
creasing number of isotopes with small abundance, while reflecting that only 
a few isotopes are expected to have large abundance. 

Discussions with spectrometrists suggest that the smallest peak that can 
possibly be distinguished from noise is about 5-10% of the average signal. 
Using the midpoint of this interval, we take e/E[ry] = Xee Ei(Ae) = 0.075. 
For t well away from the boundary of [To,Ti], Jj, 1 k(t;Tj,Uj)d,Tj « 1 (the 
kernels are density functions), and 

(4 - 3 > ™» T^, M ' T » <<f<<T " 
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Because f(t) is the normalized signal, we take E[/(£)] = 1 a priori, leading 
to the solution e = 0.075T//jlj. Inverting the function Xee X£ Ei(Xe) = 0.075, 
we obtain Ae = 0.0227, which determines A for any specified e. Values for 
the different experiments are provided in Table 1. 

4.3. Prior distribution for matrix background. As with other peaks, we 
use a log-normal distribution for the initial peak width ujq and a left-truncated 
gamma model for the abundance t]q. Because the exponential decay varies 
greatly from experiment to experiment, we utilize modestly informative pri- 
ors based on the data. To construct these, we first fit a linear regression 
with mean function log(/9o(t)) from equation (3.7) to the log intensities in 
an initial segment of the spectrum (2 kDa/e < m/z < 3.5 kDa/e). We use 
the slope and intercept from this fit to construct point estimates ujq and fjo 
to center the prior distributions 

log(w )~N(log(tJ5),cr^, ) 

with <t^ =0.25 and 

?7o~Ga(0,Ao,£), 

where Ao is the solution to (Aoe eA °Ei(eAo)) _1 =rjo- 

Finally, for the signal fraction S we use a beta prior distribution 

S ~Be(as,&s), 

with data-based mean as /(as + 65) = 1 — Y N jY and b$ = 1 (here Y N is 
the observed mean intensity in a "noise" region of the spectrum with low 
intensity and no apparent peaks, while Y is the overall mean intensity). With 
this choice the mode of the prior density for S is one whenever Y > 2Y , 
suggesting that the signal dominates, and zero when the mean in the noise 
region exceeds half the overall mean, favoring the thermal noise component 
over the nonparametric signal model. 

5. Posterior analysis. To support inference about protein location and 
abundance, and about other model parameters, we construct an ergodic 
Markov chain on the space G of possible parameter vectors 9 = {£, J, {tj,ujj, 
r]j}i<j<j,S,(uJo,r)o),\j,p} with the posterior distribution as its stationary 
distribution. At each Markov chain step we select one of the components 
of and either update it via a Gibbs step (replace the current value with 
a draw from its complete conditional posterior distribution given the other 
components) or, if this is not feasible, a random-walk Metropolis-Hastings 
(M-H) step by proposing a small change in that component which is then 
accepted or rejected according to the Hastings probabilities. Note that each 
proposed change in J (which we always take to be a random- walk step of size 
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one) changes the dimension of 6 (by three) . Such dimension changing M-H 
algorithms, introduced in Green (1995), are called reversible jump MCMC 
(RJ-MCMC) algorithms. Our approach is modeled after that of Wolpert 
and Ickstadt (2004), where a general RJ-MCMC procedure for Levy ran- 
dom field models is presented. For updating the varying dimensional param- 
eters {Tj,u)j,r)j}x<j<j we consider the possible moves of three basic types: 
peak birth [incrementing J by one and introducing a new triplet (t*,w*,?7*)]; 
peak death [decrementing J by one and removing a randomly-chosen triplet 
(tj,(jJj,t)j)]; and peak update [moving a randomly-chosen triplet (tj,ojj,t]j) 
within M 3 ]. We also incorporate two additional move types, peak splitting, 
in which a single peak is replaced by a pair of smaller ones, and the reverse 
move, peak merging, in which two nearby peaks are replaced with a single 
larger one. These lead to a vast improvement in algorithmic efficiency over 
RJ-MCMC algorithms using only birth/death and update steps. 

For sufficiently large spectra or complex protein mixtures, convergence to 
the posterior distribution from random starting values may require upward 
of a million iterations. To reduce computation time, we begin the Markov 
chain close to a mode, located using an EM algorithm [Dempster, Laird and 
Rubin (1977)] for a simple Gaussian approximation to our LARK model; 
see House (2006) for details of this and the RJ-MCMC algorithm. 

5.1. Peak identification. Features of the configuration {(tj,ojj,t]j)}i<j<j 
are updated at each iteration of the RJ-MCMC sampler, with the number 
of peaks J and associated parameters changing. While the posterior mean 
of J, J PM , and the posterior mean function E[/z(i)|y] [see equation (4.1)] 
are well-defined quantities that may be used to summarize the RJ-MCMC 
output, the well known label switching problem complicates peak identifica- 
tion. We have two ways of identifying peaks in an RJ-MCMC run. The first 
is to use the J HP peak locations {r? p } in the single RC-MCMC iteration 
with highest posterior (HP) density, which is proportional to the product of 
the likelihood function and prior density evaluated at the parameter vector 
for that iteration. Alternatively, we may use model averaging to identity 
local maxima in the denoised signal by identifying the J down-crossings of 
the derivative (d/dt)E[[i(t)\Y] = E[/j,' (t)\Y]; these local modes {tJ} are the 
collection of J v solutions of 

(5.1) | E [ / ,( t _)|y ]>0 >^E[ / ,(t+)|y]. 

Because derivatives of densities may be used to construct wavelets, the im- 
plied LARK model for the derivative process actually uses a continuous 
wavelet dictionary based on the "Mexican Hat" (or Marr's) family [Vi- 
dakovic (1999), pages 48-49]. This method is closely related to the recent 
paper of Nguyen et al. (2010) who use zero-crossings of the derivatives of 
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Gaussian wavelets. Typically J is smaller than either J or J [one 
reason is that some pairs of peaks with small inter-peak distance \tj — Tfc| 
will combine to generate a single local maximum of //(£), but all the major 
peaks will be represented among the {r^ 7 }]. 

6. Examples. In this section we illustrate our method using five data 
sets: a blank spectrum, a matrix spectrum, a "spiked" spectrum from a sam- 
ple of known protein composition, and two spectra (one single and one mean) 
for a serum sample from a patient diagnosed with lung cancer. All data were 
generated using the Voyager DE spectrometer [Applied Biosystems (2001)] 
at Duke University. Prior hyperparameters were chosen as described in Sec- 
tion 4 and are given in Table 1 for the five data sets. All examples in this 
section use the Cauchy kernel, selected because of its better fit to similar 
data in a preliminary investigation. All RJ-MCMC were run for 500,000 iter- 
ations to ensure convergence (burn-in), with an additional 500,000 iterations 
used for posterior inference. 

6.1. Blank spectrum. Figure 4(a) shows the recorded spectrum from the 
average often blank-plate spectra each based on 32 laser shots, with the pos- 
terior mean E [//(£) |Y] shown as a solid curve. The two rows of tick marks on 
the horizontal axis represent peak locations identified using the highest pos- 
terior realization {rF } (top row) and local maxima under model averaging 
{r^ 7 } (bottom). The highest-posterior realization included J HP = 38 peaks, 

the local modes under model averaging included J = 22 peaks, while the 
posterior mean (and standard deviation) were J PM = E[J|Y"] = 46.92 (4.22). 
With no solution or matrix on the metal plate, there can be no protein sig- 
nature, but nevertheless the spectrum shows numerous low-resolution peaks. 
The posterior expected resolution in the blank spectrum was E[^|Y] = 16.76 
(2.92), lower than the informative prior mean and significantly lower than 
the typical resolutions for protein peaks in the other examples (see Table 2). 
These apparent peaks may reflect laser fluctuations or resonances in the de- 
tector. The number of periods and intensities may vary unpredictably across 
spectroscopic samples, so rather than use a harmonic function as in Harezlak 
et al. (2008), we instead allow the adaptive LARK model to identify and fit 
these low-resolution peaks as part of /(•)• They may be discriminated from 
protein peaks in post-processing by their lower resolution. 

6.2. Matrix spectrum. Figure 4(b) shows the average of ten spectra, each 
based on 32 laser shots from a sinapinic matrix solution containing no pro- 
tein serum sample. The posterior mean under the LARK model shows the 
characteristic near-exponential spectral fall-off arising from the very low- 
molecular-weight sinapinic acid matrix ions, as well as several low-resolution 
peaks [posterior mean for g = 15.03 (1-70)] comparable to those in the blank 
spectrum. 
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Fig. 4. 27ie mean spectra based on ten replicates for (a) an empty plate illustrating 
thermal noise and (b) the sinapinic acid matrix with no proteins. The posterior means 
from LARK are shown as solid curves with identified peak locations indicated as two rows of 
tick marks: highest posterior realization (top, gray) and local modes under model averaging 
(bottom, black). Note the difference in scale of Y axes. 



6.3. Known protein spectrum. Figure 5 shows the average of ten spectra 
(each with 32 shots) from a preparation of five proteins with known masses 
provided by Professor M. Fitzgerald in the Department of Chemistry at 
Duke. The five known masses of singly-charged molecules are indicated by 
solid triangles and the five peaks for doubly-charged molecules are indicated 
by open triangles; each open triangle for a doubly-charged peak lies at one- 
half the m/z value of the singly-charged peak for the same molecule. Finally, 
one triply-charged peak at 22.1 kDa/e (one-third the singly-charged value) 
is indicated by an inverted triangle. Peaks identified by our procedure are 
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Table 2 
Posterior means and (standard deviations) for model parameters for the five 

experimental data sets 



Data set 


S 


V 


Q 


■no 


LOO 


jPM 


J HP 


J v 


Blank 


0.5660 


7.385 


16.76 


38.59 


233.70 


46.92 


38 


22 




(0.0537) 


(0.126) 


(2.92) 


(11.14) 


(150.80) 


(4.22) 






Matrix 


0.9569 


6.770 


15.03 


163.70 


17.41 


12.27 


11 


7 




(0.0063) 


(0.121) 


(1.70) 


(2.63) 


(0.21) 


(0.45) 






Known 


0.9016 


2.609 


96.43 


78.26 


22.80 


42.24 


42 


28 




(0.0021) 


(0.093) 


(5.54) 


(2.87) 


(0.95) 


(0.96) 






Single 


0.9996 


0.310 


49.46 


182.30 


16.76 


52.68 


45 


45 




(0.0004) 


(0.005) 


(3.56) 


(1.77) 


(0.15) 


(2.65) 






Mean 


0.9846 


3.952 


69.35 


188.40 


15.77 


76.91 


74 


54 




(0.0031) 


(0.066) 


(3.55) 


(0.97) 


(0.07) 


(1.47) 







indicated by vertical tick marks; these include all eleven "true" peaks, plus 
several additional peaks. These may reflect contaminants, differential iso- 
topic compositions or thermal noise. Several of these identified peaks have 
resolutions in the range of the median resolution for the blank spectrum 
(Table 2), suggesting that the model is capturing the thermal noise com- 
ponent. The peak at 3.903 kDa (just below the smallest "true" peak) has 
higher resolution than typical thermal peaks, and also higher abundance. It 
is clearly present in all ten replicates, suggesting a potential contaminant 
in the mixture. Features of the LARK model, such as the resolution and 
abundance parameters, may aid in reducing false positives and prioritizing 
masses for further study, beyond using estimated mass alone. 

6.4. Lung cancer protein spectrum. Figure 6 displays posterior recon- 
structions from LARK for different segments of the single and mean spec- 
tra for the complete data depicted in Figure 1. The noise reduction from 
averaging several spectra results in higher estimated precision <p, just as 
one would expect (Table 2), approximately ten times higher than that for 
a single spectrum. The resolution is also higher in the mean spectrum, lead- 
ing to the identification of a larger number of peaks. Posterior means for 
other fixed dimensional summaries are comparable for the single and mean 
spectra. Figure 6(d)-(f) illustrates the ability of the LARK model to de- 
convolve a single large peak into several peaks. While there is a local mode 
at 33.5 kDa, corresponding to the doubly charged peak for albumin (mass 
67 kDa), the relationship between resolution and mass suggests that there 
are other molecules present that lead to the wider than expected peak and 
its asymmetric shape. The highest posterior realization from LARK provides 
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Fig. 5. LARK posterior mean (solid line) and data from a mixture of five known pro- 
teins (a) and associated posterior distribution of resolution (b) . Solid triangles represent 
singly-charged molecules, open triangles represent doubly-charged molecules and the in- 
verted triangle represents a triply-charged protein. The rows of tick marks represent iden- 
tified peak locations using the highest posterior draw (top, gray) and local modes under 
model averaging (bottom, black). The horizontal line in (b) corresponds to the median 
resolution from the blank spectrum analysis. 

a way to estimate these masses (in contrast, local modes of the estimated sig- 
nal would suggest just a single protein, while methods that identify peaks by 
regions where the estimated signal is greater than some specified threshold 
would report almost the entire range between 31-39 kDa). 

7. Simulation study. Coombes et al. (2005a) construct a mathematical 
model for MALDI-TOF mass spectrometry based on the physics of the pro- 
cess, providing a virtual mass spectrometer. Morris et al. (2005) use this 
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Fig. 6. Peak reconstruction using the LARK posterior mean (solid line) for the lung 
cancer patient using the single spectrum (left) and mean spectrum (right) for segments of 
the respective spectra. The two rows of tick marks on the horizontal axis correspond to 
identified peak locations using the local maxima from highest probability draw (top, gray) 
and model averaging (bottom, black). 

simulator to generate spectra to explore operating characteristics of their 
mean-spectrum undecimated wavelet threshold (MUDWT) peak detection 
method. They generated 100 data sets, each comprised of 100 simulated 
spectra generated with 150 true peaks and additive i.i.d. Gaussian errors 
(sd a = 66) [for details, see Section 4.3 of Morris et al. (2005); raw data and 
code are available on the Cromwell (2004) website]. Although in theory the 
simulated spectra should not need alignment before calculating mean spec- 
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tra, our exploratory analysis revealed boundary effects in the initial segment 
(below approximately 3 kDa) due to averaging, and visual peaks outside of 
the true peak mass ±0.3% tolerance window used by Morris et al. (2005) to 
classify matches of identified peaks to true peaks. 5 Thus, for this analysis we 
use mean spectra from each of the 100 data sets for all masses greater than 
3 kDa. In addition to the MUDWT approach of Morris et al. and LARK, 
we identified peaks using the R package PROcess [Li (2005)], which provides 
automatic baseline subtraction and peak identification. 

Because the simulated data were generated using Gaussian errors with 
constant variance, we replaced the gamma likelihood of the LARK model 
with a Gaussian likelihood, with mean n(t) and constant precision <p = 1/a 2 . 
Otherwise, all model parameters have the same interpretation as before. Pre- 
liminary investigation for a region of the spectra with a single known peak 
indicated that the Gaussian kernel provided a better fit than the Cauchy 
kernel we used for the experimental data. This also suggested that the sim- 
ulated spectra featured higher resolution than those of the Voyager ma- 
chine, leading us to select \i e = 300, keeping a e = 0.49 as before. Following 
Morris et al., we use a robust estimate of a from the wavelet decomposition 
as an alternative to the data-based prior described in Section 4 for the non- 
constant variance model. For all remaining parameters, we used the default 
methods described in Section 4 to specify hyperparameters (see Table 1). 
We ran the RJ-MCMC algorithm for one million iterations, half for burn-in 
and half for posterior inference. Peaks were identified using the local mode 
under model averaging and the highest posterior realization. 

For all methods, true peaks were classified as a true positive or true dis- 
covery if the mass of any identified peak was within ±0.3% of a true mass [as 
in Morris et al. (2005)]. Identified peaks outside of the ±0.3% tolerance win- 
dow of true peaks were regarded as false positives. The True Positive Rate 
(TPR), the proportion of true discoveries, and False Discovery Rate (FDR), 
the proportion of false positives out of all identified peaks, were calculated 
for each of the 100 simulated mean spectra and are summarized in Table 3. 
Overall, PROcess has the lowest average FDR of all methods — however, its 
TPR is much worse than either of the LARK or MUDWT methods, which 
estimate the baseline and denoise simultaneously. Peak identification using 
down-crossings under model averaging (LARK-MA) has the better FDR of 
the two LARK methods, but has a lower TPR because local modes may 
miss overlapping peaks. Peak identification using the LARK HP realization 
and the wavelet method provide comparable TPR and FDR performance, 
with the MUDWT method having a slightly better TPR, while the LARK- 
HP method has a slightly improved FDR. The absolute differences of TPR 
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Table 3 
Summaries of true positive rates (TPR) and false discovery rates (FDR) 

for 100 simulated mean spectra using the mean-spectrum undecimated 

wavelet transform (MUDWT) [Morris et al. (2005)], LARK with highest 

posterior realization (LARK-HP), LARK with local modes under model 

averaging (LARK-MA) and PROcess [Li (2005)] 

Summary Method Average 95% range 

TPR 



FDR 



MUDWT 


0.83 


0.77-0.89 


LARK-HP 


0.82 


0.75-0.89 


LARK-MA 


0.73 


0.61-0.81 


PROcess 


0.31 


0.10-0.43 


MUDWT 


0.06 


0.003-0.208 


LARK-HP 


0.05 


0.000-0.132 


LARK-MA 


0.01 


0.000-0.050 


PROcess 


0.0033 


0.000-0.032 



and FDR rates for the two methods are both about 0.01, "statistically sig- 
nificant" using a paired t-test but not practically significant, leading to the 
discovery of an extra 1-2 proteins by MUDWT (on average) at the cost 
of a comparable number of additional false positives. A further breakdown 
of the TPR for LARK by prevalence and abundance [as in Morris et al. 
(2005), Table 4] is provided in the supplemental materials [House, Clyde 
and Wolpert (2011)], with LARK-HP having substantially higher TPR than 
MUDWT for peaks in the higher prevalence groups across all abundance cat- 
egories, but poorer performance than MUDWT for the two lowest prevalence 
groups. 

8. Discussion. The Gaussian LARK model leads to true positive rates 
and false discovery rates comparable to adaptive nonparametric wavelet 
methods for simulated Gaussian intensities data, while the gamma LARK 
model is able to capture the mean/variance relationship that is observed in 
experimental data. Exploratory data analysis may be used to decide which 
model is more appropriate by examining the mean/variance relationship or 
residual analysis, with other error models easily substituted to define alter- 
native likelihood functions given the mean function. 

A key feature of the LARK methodology is the ability to deconvolve 
large peaks into mixtures of protein signatures. The model is able to iden- 
tify all masses for the laboratory experiment with known protein mixture, 
even though many of the doubly charged proteins have low abundance. To 
capture the multiply charged nature of proteins in MALDI-TOF even more 
effectively, LARK models may be constructed using a kernel tailored to this 
purpose as a mixture of two peaks centered at the singly and doubly charged 
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masses, with a mixing weight to control the relative abundance of the two 
charges. This constraint reduces the number of free parameters and may lead 
to a more efficient algorithm. The method described in this paper is intended 
for use with either a single spectrum or a mean spectrum. Hierarchical ver- 
sions of the LARK model for MALDI-TOF data are under development for 
modeling multiple spectra, which provide automatic calibration of multiple 
spectra and permit classification of subjects into groups. 

The LARK models are implemented in an R [R Development Core Team 
(2010)] package, with a shared library written in C and FORTRAN for the 
RJ-MCMC algorithm. Although the LARK model for peak identification is 
more computationally intensive than the wavelet method of Morris et al. 
(2005), with 10,000 iterations taking 10 minutes on a dual 3 GHz Quad 
Core Xeon Mac Pro for the simulation study (running on a single proces- 
sor), its running times increase only linearly with the number of peaks and 
volume of data, since no matrix inversion is required. Despite the compu- 
tational overhead of RJ-MCMC, Clyde and Wolpert (2007) and Wolpert, 
Clyde and Tu (2011) have demonstrated that LARK models can provide 
significant reductions in mean squared error in comparison with some of the 
best wavelet methods such as the nondecimated wavelet approach of John- 
stone and Silverman (2005) and the continuous wavelets of Chu, Clyde and 
Liang (2009). Future work will incorporate advances in adaptive MCMC 
methods which may accelerate convergence for random-walk update steps 
or lead to improved proposal distributions for peak birth based on residuals 
or peak death utilizing abundance. The software is available from the first 
author's website. 
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SUPPLEMENTARY MATERIAL 

Additional results for the simulation study 

(DOI: 10.1214/10-AOAS450SUPP; .pdf). True positive rates for LARK es- 
timates from the simulation study broken down by peak prevalence and 
average intensity of peaks across samples. 
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